{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PAPEA Pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PAPEA Pipeline Part 1\n",
    "\n",
    "Authors: Sebastian Haunss, Priska Daphi, Jan Matti Dollbaum, Lidiya Hristova, Pál Susánszky, Elias Steinhilper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import the necessary modules\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This file describes all necessary steps to run the PAPEA pipeline on a set of newspaper articles stored in a csv file with a format similar to the example below. THe scripts use a small example file with articles from the German newspaper \"taz - die tageszeitung\". In your own file, some columns may have different names, but the following columns need to be named exacly as in the example:\n",
    "* **aid**: a unique article identifier\n",
    "* **date**: a column with the article publication date\n",
    "* **text**: the full text of the article including the title\n",
    "* **source**: the name of the newspaper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>aid</th>\n",
       "      <th>atype</th>\n",
       "      <th>source</th>\n",
       "      <th>section</th>\n",
       "      <th>date</th>\n",
       "      <th>title</th>\n",
       "      <th>subtitle</th>\n",
       "      <th>author</th>\n",
       "      <th>text</th>\n",
       "      <th>keyword</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5233161</td>\n",
       "      <td>Agentur</td>\n",
       "      <td>taz</td>\n",
       "      <td>Wirtschaft und Umwelt</td>\n",
       "      <td>28.09.2015</td>\n",
       "      <td>Greenpeace-Proteste an Shell-Tankstellen</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Greenpeace-Proteste an Shell-Tankstellen. NA. ...</td>\n",
       "      <td>protesti</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>242902</td>\n",
       "      <td>Agentur</td>\n",
       "      <td>taz</td>\n",
       "      <td>Seite 1</td>\n",
       "      <td>07.01.2015</td>\n",
       "      <td>Streit über Roman</td>\n",
       "      <td>ISLAM-BUCH Houellebecqs Vision eines Frankreic...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Streit über Roman. ISLAM-BUCH Houellebecqs Vis...</td>\n",
       "      <td>Demonstr</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>224306</td>\n",
       "      <td>TAZ-Bericht</td>\n",
       "      <td>taz</td>\n",
       "      <td>Kultur</td>\n",
       "      <td>16.02.2015</td>\n",
       "      <td>Lasst rote Flecken sprießen</td>\n",
       "      <td>BERLINALE Die 65. Internationalen Filmfestspie...</td>\n",
       "      <td>CRISTINA NORD</td>\n",
       "      <td>Lasst rote Flecken sprießen. BERLINALE Die 65....</td>\n",
       "      <td>Widerstand</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5249499</td>\n",
       "      <td>TAZ-Bericht</td>\n",
       "      <td>taz</td>\n",
       "      <td>Sport</td>\n",
       "      <td>21.11.2015</td>\n",
       "      <td>Der Sport zählt fantastisch</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Thomas Winkler</td>\n",
       "      <td>Der Sport zählt fantastisch. NA. Am Spieltag k...</td>\n",
       "      <td>protesti</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5210229</td>\n",
       "      <td>TAZ-Bericht</td>\n",
       "      <td>taz Berlin</td>\n",
       "      <td>Kultur</td>\n",
       "      <td>08.07.2015</td>\n",
       "      <td>Subversion des Wissens</td>\n",
       "      <td>Ideengeschichte Wer hat’s erfunden? In der Ber...</td>\n",
       "      <td>Cord Riechelmann</td>\n",
       "      <td>Subversion des Wissens. Ideengeschichte Wer ha...</td>\n",
       "      <td>Widerstand</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       aid        atype      source                section        date  \\\n",
       "0  5233161      Agentur         taz  Wirtschaft und Umwelt  28.09.2015   \n",
       "1   242902      Agentur         taz                Seite 1  07.01.2015   \n",
       "2   224306  TAZ-Bericht         taz                 Kultur  16.02.2015   \n",
       "3  5249499  TAZ-Bericht         taz                  Sport  21.11.2015   \n",
       "4  5210229  TAZ-Bericht  taz Berlin                 Kultur  08.07.2015   \n",
       "\n",
       "                                      title  \\\n",
       "0  Greenpeace-Proteste an Shell-Tankstellen   \n",
       "1                         Streit über Roman   \n",
       "2               Lasst rote Flecken sprießen   \n",
       "3               Der Sport zählt fantastisch   \n",
       "4                    Subversion des Wissens   \n",
       "\n",
       "                                            subtitle            author  \\\n",
       "0                                                NaN               NaN   \n",
       "1  ISLAM-BUCH Houellebecqs Vision eines Frankreic...               NaN   \n",
       "2  BERLINALE Die 65. Internationalen Filmfestspie...     CRISTINA NORD   \n",
       "3                                                NaN    Thomas Winkler   \n",
       "4  Ideengeschichte Wer hat’s erfunden? In der Ber...  Cord Riechelmann   \n",
       "\n",
       "                                                text     keyword  \n",
       "0  Greenpeace-Proteste an Shell-Tankstellen. NA. ...    protesti  \n",
       "1  Streit über Roman. ISLAM-BUCH Houellebecqs Vis...    Demonstr  \n",
       "2  Lasst rote Flecken sprießen. BERLINALE Die 65....  Widerstand  \n",
       "3  Der Sport zählt fantastisch. NA. Am Spieltag k...    protesti  \n",
       "4  Subversion des Wissens. Ideengeschichte Wer ha...  Widerstand  "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "articles = pd.read_csv('../data/taz2015_sample.csv', encoding=\"utf8\")\n",
    "articles.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Individual Steps of the PAPEA Pipeline\n",
    "\n",
    "The following Jupyther Notebook provides the code for all steps in the PAPEA pipeline that are executed in Python. Additional steps follow in R. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 1: Relevance Prediction\n",
    "The first Python script loads a dataset with newspaper articles containing at least one protest keyword from a csv file and predicts whether the article actually reports about a protest or only contains the keyword while actually reporting about something else. </br>\n",
    "In this step we also compute a slimmed down version of the article full text, containing only those sentences that contain one of our protest keywords plus the one respective preceding and and trailing sentence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 591/591 [00:00<00:00, 645.88it/s]\n",
      "Predicting labels:  58%|█████▊    | 344/591 [00:06<00:04, 54.55text/s]\n"
     ]
    },
    {
     "ename": "KeyboardInterrupt",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m                         Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[26], line 54\u001b[0m\n\u001b[1;32m     51\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m predictions, pred_scores\n\u001b[1;32m     53\u001b[0m \u001b[38;5;66;03m# Get predictions and scores\u001b[39;00m\n\u001b[0;32m---> 54\u001b[0m predicted_labels, predicted_scores \u001b[38;5;241m=\u001b[39m predict_labels_and_scores(texts, tokenizer, model)\n\u001b[1;32m     56\u001b[0m \u001b[38;5;66;03m# Add predictions and scores to the DataFrame and save to a new file\u001b[39;00m\n\u001b[1;32m     57\u001b[0m df[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mpred_text\u001b[39m\u001b[38;5;124m'\u001b[39m] \u001b[38;5;241m=\u001b[39m texts\n",
      "Cell \u001b[0;32mIn[26], line 39\u001b[0m, in \u001b[0;36mpredict_labels_and_scores\u001b[0;34m(texts, tokenizer, model)\u001b[0m\n\u001b[1;32m     36\u001b[0m inputs \u001b[38;5;241m=\u001b[39m {k: v\u001b[38;5;241m.\u001b[39mto(device) \u001b[38;5;28;01mfor\u001b[39;00m k, v \u001b[38;5;129;01min\u001b[39;00m inputs\u001b[38;5;241m.\u001b[39mitems()}  \u001b[38;5;66;03m# Move to GPU\u001b[39;00m\n\u001b[1;32m     38\u001b[0m \u001b[38;5;66;03m# Perform prediction\u001b[39;00m\n\u001b[0;32m---> 39\u001b[0m outputs \u001b[38;5;241m=\u001b[39m model(\u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39minputs)\n\u001b[1;32m     40\u001b[0m logits \u001b[38;5;241m=\u001b[39m outputs\u001b[38;5;241m.\u001b[39mlogits\n\u001b[1;32m     42\u001b[0m \u001b[38;5;66;03m# Convert logits to probabilities\u001b[39;00m\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/torch/nn/modules/module.py:1736\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1734\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)  \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m   1735\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1736\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/torch/nn/modules/module.py:1747\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1742\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m   1743\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m   1744\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m   1745\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m   1746\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1747\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m forward_call(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[1;32m   1749\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m   1750\u001b[0m called_always_called_hooks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m()\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py:1327\u001b[0m, in \u001b[0;36mXLMRobertaForSequenceClassification.forward\u001b[0;34m(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)\u001b[0m\n\u001b[1;32m   1319\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124mr\u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m   1320\u001b[0m \u001b[38;5;124;03mlabels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):\u001b[39;00m\n\u001b[1;32m   1321\u001b[0m \u001b[38;5;124;03m    Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,\u001b[39;00m\n\u001b[1;32m   1322\u001b[0m \u001b[38;5;124;03m    config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If\u001b[39;00m\n\u001b[1;32m   1323\u001b[0m \u001b[38;5;124;03m    `config.num_labels > 1` a classification loss is computed (Cross-Entropy).\u001b[39;00m\n\u001b[1;32m   1324\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m   1325\u001b[0m return_dict \u001b[38;5;241m=\u001b[39m return_dict \u001b[38;5;28;01mif\u001b[39;00m return_dict \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mconfig\u001b[38;5;241m.\u001b[39muse_return_dict\n\u001b[0;32m-> 1327\u001b[0m outputs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mroberta(\n\u001b[1;32m   1328\u001b[0m     input_ids,\n\u001b[1;32m   1329\u001b[0m     attention_mask\u001b[38;5;241m=\u001b[39mattention_mask,\n\u001b[1;32m   1330\u001b[0m     token_type_ids\u001b[38;5;241m=\u001b[39mtoken_type_ids,\n\u001b[1;32m   1331\u001b[0m     position_ids\u001b[38;5;241m=\u001b[39mposition_ids,\n\u001b[1;32m   1332\u001b[0m     head_mask\u001b[38;5;241m=\u001b[39mhead_mask,\n\u001b[1;32m   1333\u001b[0m     inputs_embeds\u001b[38;5;241m=\u001b[39minputs_embeds,\n\u001b[1;32m   1334\u001b[0m     output_attentions\u001b[38;5;241m=\u001b[39moutput_attentions,\n\u001b[1;32m   1335\u001b[0m     output_hidden_states\u001b[38;5;241m=\u001b[39moutput_hidden_states,\n\u001b[1;32m   1336\u001b[0m     return_dict\u001b[38;5;241m=\u001b[39mreturn_dict,\n\u001b[1;32m   1337\u001b[0m )\n\u001b[1;32m   1338\u001b[0m sequence_output \u001b[38;5;241m=\u001b[39m outputs[\u001b[38;5;241m0\u001b[39m]\n\u001b[1;32m   1339\u001b[0m logits \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mclassifier(sequence_output)\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/torch/nn/modules/module.py:1736\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1734\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)  \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m   1735\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1736\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/torch/nn/modules/module.py:1747\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1742\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m   1743\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m   1744\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m   1745\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m   1746\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1747\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m forward_call(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[1;32m   1749\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m   1750\u001b[0m called_always_called_hooks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m()\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py:977\u001b[0m, in \u001b[0;36mXLMRobertaModel.forward\u001b[0;34m(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)\u001b[0m\n\u001b[1;32m    970\u001b[0m \u001b[38;5;66;03m# Prepare head mask if needed\u001b[39;00m\n\u001b[1;32m    971\u001b[0m \u001b[38;5;66;03m# 1.0 in head_mask indicate we keep the head\u001b[39;00m\n\u001b[1;32m    972\u001b[0m \u001b[38;5;66;03m# attention_probs has shape bsz x n_heads x N x N\u001b[39;00m\n\u001b[1;32m    973\u001b[0m \u001b[38;5;66;03m# input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\u001b[39;00m\n\u001b[1;32m    974\u001b[0m \u001b[38;5;66;03m# and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\u001b[39;00m\n\u001b[1;32m    975\u001b[0m head_mask \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mget_head_mask(head_mask, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mconfig\u001b[38;5;241m.\u001b[39mnum_hidden_layers)\n\u001b[0;32m--> 977\u001b[0m encoder_outputs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mencoder(\n\u001b[1;32m    978\u001b[0m     embedding_output,\n\u001b[1;32m    979\u001b[0m     attention_mask\u001b[38;5;241m=\u001b[39mextended_attention_mask,\n\u001b[1;32m    980\u001b[0m     head_mask\u001b[38;5;241m=\u001b[39mhead_mask,\n\u001b[1;32m    981\u001b[0m     encoder_hidden_states\u001b[38;5;241m=\u001b[39mencoder_hidden_states,\n\u001b[1;32m    982\u001b[0m     encoder_attention_mask\u001b[38;5;241m=\u001b[39mencoder_extended_attention_mask,\n\u001b[1;32m    983\u001b[0m     past_key_values\u001b[38;5;241m=\u001b[39mpast_key_values,\n\u001b[1;32m    984\u001b[0m     use_cache\u001b[38;5;241m=\u001b[39muse_cache,\n\u001b[1;32m    985\u001b[0m     output_attentions\u001b[38;5;241m=\u001b[39moutput_attentions,\n\u001b[1;32m    986\u001b[0m     output_hidden_states\u001b[38;5;241m=\u001b[39moutput_hidden_states,\n\u001b[1;32m    987\u001b[0m     return_dict\u001b[38;5;241m=\u001b[39mreturn_dict,\n\u001b[1;32m    988\u001b[0m )\n\u001b[1;32m    989\u001b[0m sequence_output \u001b[38;5;241m=\u001b[39m encoder_outputs[\u001b[38;5;241m0\u001b[39m]\n\u001b[1;32m    990\u001b[0m pooled_output \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mpooler(sequence_output) \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mpooler \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/torch/nn/modules/module.py:1736\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1734\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)  \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m   1735\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1736\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/torch/nn/modules/module.py:1747\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1742\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m   1743\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m   1744\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m   1745\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m   1746\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1747\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m forward_call(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[1;32m   1749\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m   1750\u001b[0m called_always_called_hooks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m()\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py:632\u001b[0m, in \u001b[0;36mXLMRobertaEncoder.forward\u001b[0;34m(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)\u001b[0m\n\u001b[1;32m    621\u001b[0m     layer_outputs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_gradient_checkpointing_func(\n\u001b[1;32m    622\u001b[0m         layer_module\u001b[38;5;241m.\u001b[39m\u001b[38;5;21m__call__\u001b[39m,\n\u001b[1;32m    623\u001b[0m         hidden_states,\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    629\u001b[0m         output_attentions,\n\u001b[1;32m    630\u001b[0m     )\n\u001b[1;32m    631\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m--> 632\u001b[0m     layer_outputs \u001b[38;5;241m=\u001b[39m layer_module(\n\u001b[1;32m    633\u001b[0m         hidden_states,\n\u001b[1;32m    634\u001b[0m         attention_mask,\n\u001b[1;32m    635\u001b[0m         layer_head_mask,\n\u001b[1;32m    636\u001b[0m         encoder_hidden_states,\n\u001b[1;32m    637\u001b[0m         encoder_attention_mask,\n\u001b[1;32m    638\u001b[0m         past_key_value,\n\u001b[1;32m    639\u001b[0m         output_attentions,\n\u001b[1;32m    640\u001b[0m     )\n\u001b[1;32m    642\u001b[0m hidden_states \u001b[38;5;241m=\u001b[39m layer_outputs[\u001b[38;5;241m0\u001b[39m]\n\u001b[1;32m    643\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m use_cache:\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/torch/nn/modules/module.py:1736\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1734\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)  \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m   1735\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1736\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/torch/nn/modules/module.py:1747\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1742\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m   1743\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m   1744\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m   1745\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m   1746\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1747\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m forward_call(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[1;32m   1749\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m   1750\u001b[0m called_always_called_hooks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m()\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py:521\u001b[0m, in \u001b[0;36mXLMRobertaLayer.forward\u001b[0;34m(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)\u001b[0m\n\u001b[1;32m    509\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mforward\u001b[39m(\n\u001b[1;32m    510\u001b[0m     \u001b[38;5;28mself\u001b[39m,\n\u001b[1;32m    511\u001b[0m     hidden_states: torch\u001b[38;5;241m.\u001b[39mTensor,\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    518\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m Tuple[torch\u001b[38;5;241m.\u001b[39mTensor]:\n\u001b[1;32m    519\u001b[0m     \u001b[38;5;66;03m# decoder uni-directional self-attention cached key/values tuple is at positions 1,2\u001b[39;00m\n\u001b[1;32m    520\u001b[0m     self_attn_past_key_value \u001b[38;5;241m=\u001b[39m past_key_value[:\u001b[38;5;241m2\u001b[39m] \u001b[38;5;28;01mif\u001b[39;00m past_key_value \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[0;32m--> 521\u001b[0m     self_attention_outputs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mattention(\n\u001b[1;32m    522\u001b[0m         hidden_states,\n\u001b[1;32m    523\u001b[0m         attention_mask,\n\u001b[1;32m    524\u001b[0m         head_mask,\n\u001b[1;32m    525\u001b[0m         output_attentions\u001b[38;5;241m=\u001b[39moutput_attentions,\n\u001b[1;32m    526\u001b[0m         past_key_value\u001b[38;5;241m=\u001b[39mself_attn_past_key_value,\n\u001b[1;32m    527\u001b[0m     )\n\u001b[1;32m    528\u001b[0m     attention_output \u001b[38;5;241m=\u001b[39m self_attention_outputs[\u001b[38;5;241m0\u001b[39m]\n\u001b[1;32m    530\u001b[0m     \u001b[38;5;66;03m# if decoder, the last output is tuple of self-attn cache\u001b[39;00m\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/torch/nn/modules/module.py:1736\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1734\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)  \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m   1735\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1736\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/torch/nn/modules/module.py:1747\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1742\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m   1743\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m   1744\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m   1745\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m   1746\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1747\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m forward_call(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[1;32m   1749\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m   1750\u001b[0m called_always_called_hooks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m()\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py:457\u001b[0m, in \u001b[0;36mXLMRobertaAttention.forward\u001b[0;34m(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)\u001b[0m\n\u001b[1;32m    438\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mforward\u001b[39m(\n\u001b[1;32m    439\u001b[0m     \u001b[38;5;28mself\u001b[39m,\n\u001b[1;32m    440\u001b[0m     hidden_states: torch\u001b[38;5;241m.\u001b[39mTensor,\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    446\u001b[0m     output_attentions: Optional[\u001b[38;5;28mbool\u001b[39m] \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mFalse\u001b[39;00m,\n\u001b[1;32m    447\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m Tuple[torch\u001b[38;5;241m.\u001b[39mTensor]:\n\u001b[1;32m    448\u001b[0m     self_outputs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mself(\n\u001b[1;32m    449\u001b[0m         hidden_states,\n\u001b[1;32m    450\u001b[0m         attention_mask,\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    455\u001b[0m         output_attentions,\n\u001b[1;32m    456\u001b[0m     )\n\u001b[0;32m--> 457\u001b[0m     attention_output \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moutput(self_outputs[\u001b[38;5;241m0\u001b[39m], hidden_states)\n\u001b[1;32m    458\u001b[0m     outputs \u001b[38;5;241m=\u001b[39m (attention_output,) \u001b[38;5;241m+\u001b[39m self_outputs[\u001b[38;5;241m1\u001b[39m:]  \u001b[38;5;66;03m# add attentions if we output them\u001b[39;00m\n\u001b[1;32m    459\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m outputs\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/torch/nn/modules/module.py:1736\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1734\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)  \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m   1735\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1736\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/torch/nn/modules/module.py:1747\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1742\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m   1743\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m   1744\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m   1745\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m   1746\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1747\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m forward_call(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[1;32m   1749\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m   1750\u001b[0m called_always_called_hooks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m()\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py:399\u001b[0m, in \u001b[0;36mXLMRobertaSelfOutput.forward\u001b[0;34m(self, hidden_states, input_tensor)\u001b[0m\n\u001b[1;32m    397\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mforward\u001b[39m(\u001b[38;5;28mself\u001b[39m, hidden_states: torch\u001b[38;5;241m.\u001b[39mTensor, input_tensor: torch\u001b[38;5;241m.\u001b[39mTensor) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m torch\u001b[38;5;241m.\u001b[39mTensor:\n\u001b[1;32m    398\u001b[0m     hidden_states \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdense(hidden_states)\n\u001b[0;32m--> 399\u001b[0m     hidden_states \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdropout(hidden_states)\n\u001b[1;32m    400\u001b[0m     hidden_states \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mLayerNorm(hidden_states \u001b[38;5;241m+\u001b[39m input_tensor)\n\u001b[1;32m    401\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m hidden_states\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/torch/nn/modules/module.py:1736\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1734\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)  \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m   1735\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1736\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/torch/nn/modules/module.py:1747\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1742\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m   1743\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m   1744\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m   1745\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m   1746\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1747\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m forward_call(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[1;32m   1749\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m   1750\u001b[0m called_always_called_hooks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m()\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/torch/nn/modules/dropout.py:70\u001b[0m, in \u001b[0;36mDropout.forward\u001b[0;34m(self, input)\u001b[0m\n\u001b[1;32m     69\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mforward\u001b[39m(\u001b[38;5;28mself\u001b[39m, \u001b[38;5;28minput\u001b[39m: Tensor) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m Tensor:\n\u001b[0;32m---> 70\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m F\u001b[38;5;241m.\u001b[39mdropout(\u001b[38;5;28minput\u001b[39m, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mp, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtraining, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39minplace)\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/torch/nn/functional.py:1425\u001b[0m, in \u001b[0;36mdropout\u001b[0;34m(input, p, training, inplace)\u001b[0m\n\u001b[1;32m   1422\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m p \u001b[38;5;241m<\u001b[39m \u001b[38;5;241m0.0\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m p \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m1.0\u001b[39m:\n\u001b[1;32m   1423\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdropout probability has to be between 0 and 1, but got \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mp\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m   1424\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m (\n\u001b[0;32m-> 1425\u001b[0m     _VF\u001b[38;5;241m.\u001b[39mdropout_(\u001b[38;5;28minput\u001b[39m, p, training) \u001b[38;5;28;01mif\u001b[39;00m inplace \u001b[38;5;28;01melse\u001b[39;00m _VF\u001b[38;5;241m.\u001b[39mdropout(\u001b[38;5;28minput\u001b[39m, p, training)\n\u001b[1;32m   1426\u001b[0m )\n",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import torch\n",
    "from transformers import AutoTokenizer, AutoModelForSequenceClassification\n",
    "import torch.nn.functional as F\n",
    "from tqdm import tqdm  # Import tqdm for the progress bar\n",
    "\n",
    "from utils import reformat_df\n",
    "\n",
    "# define input and output files\n",
    "input_csv = \"../data/taz2015_sample.csv\"\n",
    "output_csv = \"../data/taz2015_sample_predicted.csv\"\n",
    "\n",
    "if __name__=='__main__':\n",
    "\n",
    "    # Check for GPU availability and set device\n",
    "    device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "\n",
    "    # Load the CSV file\n",
    "    df = pd.read_csv(input_csv, encoding=\"utf8\")\n",
    "    texts = reformat_df(df, filter_size=1)[\"text_a\"].tolist()\n",
    "\n",
    "    # Load the tokenizer and model, and move the model to GPU if available\n",
    "    tokenizer = AutoTokenizer.from_pretrained('shaunss/xlmroberta-pea-relevance-de')\n",
    "    model = AutoModelForSequenceClassification.from_pretrained('shaunss/xlmroberta-pea-relevance-de')\n",
    "    model = model.to(device)  # Move model to GPU\n",
    "    model.eval()  # Set model to evaluation mode\n",
    "\n",
    "    # Predict labels and probabilities with a progress bar\n",
    "    def predict_labels_and_scores(texts, tokenizer, model):\n",
    "        predictions = []\n",
    "        pred_scores = []\n",
    "        with torch.no_grad():\n",
    "            for text in tqdm(texts, desc=\"Predicting labels\", unit=\"text\"):  # Add progress bar\n",
    "                # Tokenize the input text and move inputs to GPU\n",
    "                inputs = tokenizer(text, return_tensors=\"pt\", padding=True, truncation=True)\n",
    "                inputs = {k: v.to(device) for k, v in inputs.items()}  # Move to GPU\n",
    "            \n",
    "                # Perform prediction\n",
    "                outputs = model(**inputs)\n",
    "                logits = outputs.logits\n",
    "            \n",
    "                # Convert logits to probabilities\n",
    "                probs = F.softmax(logits, dim=1)\n",
    "            \n",
    "                # Get the predicted label and its probability score\n",
    "                predicted_label = torch.argmax(probs, dim=1).item()\n",
    "                predicted_score = probs[0][predicted_label].item()\n",
    "                predictions.append(predicted_label)\n",
    "                pred_scores.append(predicted_score)\n",
    "    \n",
    "        return predictions, pred_scores\n",
    "\n",
    "    # Get predictions and scores\n",
    "    predicted_labels, predicted_scores = predict_labels_and_scores(texts, tokenizer, model)\n",
    "\n",
    "    # Add predictions and scores to the DataFrame and save to a new file\n",
    "    df['pred_text'] = texts\n",
    "    df['pred_label'] = predicted_labels\n",
    "    df['pred_score'] = predicted_scores\n",
    "    df.to_csv(output_csv, index=False)\n",
    "\n",
    "    print(f\"Predictions with scores saved to {output_csv}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2: Split Documents into Sentences\n",
    "Since many of the following work on the sentence level, we split the documents into sentences befor we continue."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Reduced dataset with only relevant articles saved to ../data/taz2015_sample_relevant.csv\n",
      "Sentence dataset from relevant articles saved to ../data/taz2015_sample_relevant_sentences.csv\n"
     ]
    }
   ],
   "source": [
    "# -*- coding: utf-8 -*-\n",
    "import pandas as pd\n",
    "import spacy\n",
    "\n",
    "spacy.require_gpu()\n",
    "# spacy.prefer_gpu()\n",
    "\n",
    "# Load the German language model\n",
    "nlp = spacy.load(\"de_core_news_sm\")\n",
    "\n",
    "# define input and output files\n",
    "input_csv = \"../data/taz2015_sample_predicted.csv\"\n",
    "output_documents_csv = \"../data/taz2015_sample_relevant.csv\"\n",
    "output_sentences_csv = \"../data/taz2015_sample_relevant_sentences.csv\"\n",
    "\n",
    "\n",
    "if __name__=='__main__':\n",
    "\n",
    "    df = pd.read_csv(input_csv, encoding=\"utf8\")\n",
    "    \n",
    "    # select only predeicted protest articles\n",
    "    df = df[df['pred_label'] == 1]\n",
    "\n",
    "    # save reduced dataset\n",
    "    df.to_csv(output_documents_csv, index=False)\n",
    "\n",
    "    print(f\"Reduced dataset with only relevant articles saved to {output_documents_csv}\")\n",
    "\n",
    "\n",
    "    # Function to extract sentence start and end positions and the sentences themselves\n",
    "    def extract_sentence_info(row):\n",
    "        doc = nlp(row['pred_text'])\n",
    "        sentences = [sent.text for sent in doc.sents]\n",
    "        sentence_positions = [(row['aid'], sent.text, sent.start_char, sent.end_char) for sent in doc.sents]\n",
    "        return sentence_positions\n",
    "    \n",
    "    # Apply the function to the DataFrame\n",
    "    sentence_info_list = df.apply(extract_sentence_info, axis=1).explode()\n",
    "    \n",
    "    # Create a new DataFrame\n",
    "    sentence_df = pd.DataFrame(list(sentence_info_list), columns=['aid', 'sentence', 'start', 'end'])\n",
    "    \n",
    "    # Concatenate AN and start columns\n",
    "    sentence_df['sid'] = sentence_df['aid'].astype(str) + '_' + sentence_df['start'].astype(str)\n",
    "    \n",
    "    # Rearrange columns\n",
    "    sentence_df = sentence_df[['aid', 'sid', 'sentence', 'start', 'end']]\n",
    "    \n",
    "    # save result\n",
    "    sentence_df.to_csv(output_sentences_csv, index=False)\n",
    "    print(f\"Sentence dataset from relevant articles saved to {output_sentences_csv}\")\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3a: Identify possible Claim Sentences\n",
    "Before running our claim classification model we first narrow down the selection of sentences on which this model is applied. For this step we use the MARDY claim detector."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/shaunss/miniconda3/envs/ml_papea/lib/python3.11/site-packages/sklearn/base.py:318: UserWarning: Trying to unpickle estimator LabelBinarizer from version 1.2.0 when using version 1.2.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n",
      "https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n",
      "  warnings.warn(\n",
      "/home/shaunss/miniconda3/envs/ml_papea/lib/python3.11/site-packages/sklearn/base.py:318: UserWarning: Trying to unpickle estimator MLPClassifier from version 1.2.0 when using version 1.2.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n",
      "https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n",
      "  warnings.warn(\n",
      "Encoding sentences: 100%|██████████| 337/337 [00:06<00:00, 53.07it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Predictions for claim sentences with scores saved to ../data/taz2015_sample_claim_candidate_sentences.csv\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import pickle\n",
    "from sentence_transformers import SentenceTransformer\n",
    "from sklearn.neural_network import MLPClassifier\n",
    "from tqdm import tqdm\n",
    "\n",
    "# define input and output files\n",
    "input_csv = \"../data/taz2015_sample_relevant_sentences.csv\"\n",
    "output_csv = \"../data/taz2015_sample_claim_candidate_sentences.csv\"\n",
    "\n",
    "\n",
    "### SentenceTransformer model\n",
    "model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')\n",
    "\n",
    "### Load Classifier\n",
    "import pickle\n",
    "filename = '../models/mardy_claim_detection_v1.sav'\n",
    "loaded_model = pickle.load(open(filename, 'rb'))\n",
    "\n",
    "### Batch size\n",
    "batch_size = 8\n",
    "\n",
    "if __name__=='__main__':\n",
    "\n",
    "    ### load sentences\n",
    "    protest = pd.read_csv(input_csv, encoding=\"utf8\")\n",
    "\n",
    "    ### calculate embeddings and predict claims\n",
    "    embeddings = []\n",
    "    \n",
    "    for i in tqdm(range(0, len(protest), batch_size), desc=\"Encoding sentences\"):\n",
    "       batch = protest['sentence'][i:i+batch_size].tolist()\n",
    "       batch_embeddings = model.encode(batch)\n",
    "       embeddings.extend(batch_embeddings)\n",
    "\n",
    "    protest['predicted'] = loaded_model.predict(embeddings)\n",
    "    protest['predicted_prob'] = [sublist[0] for sublist in loaded_model.predict_proba(embeddings)]\n",
    "    \n",
    "    # save predictions to file\n",
    "    protest.to_csv(output_csv, encoding=\"utf8\", index=False)\n",
    "\n",
    "    print(f\"Predictions for claim sentences with scores saved to {output_csv}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3b: Claim Classification\n",
    "We now run the claim classification model on the claim candidate sentences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Encoding sentences: 100%|██████████| 48/48 [00:01<00:00, 29.97it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Claim predictions with scores saved to ../data/taz2015_sample_claims_predicted.csv\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import torch\n",
    "import re\n",
    "import pickle\n",
    "from sentence_transformers import SentenceTransformer, util , SentencesDataset, losses, evaluation\n",
    "from sentence_transformers.readers import InputExample\n",
    "from torch.utils.data import DataLoader\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.neural_network import MLPClassifier\n",
    "from sklearn import metrics\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import classification_report\n",
    "from sklearn.metrics import accuracy_score\n",
    "import spacy\n",
    "import scipy\n",
    "from tqdm import tqdm\n",
    "\n",
    "# define input and output files\n",
    "input_csv = \"../data/taz2015_sample_claim_candidate_sentences.csv\"\n",
    "output_csv = \"../data/taz2015_sample_claims_predicted.csv\"\n",
    "\n",
    "if __name__=='__main__':\n",
    "\n",
    "    # read data\n",
    "    df_sentences = pd.read_csv(input_csv, encoding=\"utf8\")\n",
    "\n",
    "    # select only predeicted protest articles\n",
    "    df_sentences = df_sentences[df_sentences['predicted'] == True]\n",
    "\n",
    "\n",
    "    # set model\n",
    "    model_fine = SentenceTransformer('shaunss/protestclaims_gbert')\n",
    "\n",
    "    # load classifyer\n",
    "    import pickle\n",
    "    filename = '../models/protest_claim_classification_gbert.sav'\n",
    "    LR = pickle.load(open(filename, 'rb'))\n",
    "\n",
    "    ### calculate embeddings and predict claims\n",
    "    embeddings = []\n",
    "    \n",
    "    for i in tqdm(range(0, len(df_sentences), batch_size), desc=\"Encoding sentences\"):\n",
    "       batch = df_sentences['sentence'][i:i+batch_size].tolist()\n",
    "       batch_embeddings = model_fine.encode(batch)\n",
    "       embeddings.extend(batch_embeddings)\n",
    "\n",
    "    # predict categories\n",
    "    df_sentences[\"prediction_claim\"] = LR.predict(embeddings)\n",
    "    # save prediction scores\n",
    "    df_sentences[\"score_claim\"] = LR.predict_proba(embeddings)[:, 1]\n",
    "    # export as csv\n",
    "    df_sentences.to_csv(output_csv, index=False)\n",
    "\n",
    "    print(f\"Claim predictions with scores saved to {output_csv}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 4: Form Classification\n",
    "For portest forms, we run the form classification model direcly on all sentences from the relevant text segments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Encoding sentences: 100%|██████████| 337/337 [00:06<00:00, 52.73it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Form predictions with scores saved to ../data/taz2015_sample_forms_predicted.csv\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import torch\n",
    "import re\n",
    "import pickle\n",
    "from sentence_transformers import SentenceTransformer, util , SentencesDataset, losses, evaluation\n",
    "from sentence_transformers.readers import InputExample\n",
    "from torch.utils.data import DataLoader\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.neural_network import MLPClassifier\n",
    "from sklearn import metrics\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import classification_report\n",
    "from sklearn.metrics import accuracy_score\n",
    "import spacy\n",
    "import scipy\n",
    "from tqdm import tqdm\n",
    "\n",
    "\n",
    "# define input and output files\n",
    "input_csv = \"../data/taz2015_sample_relevant_sentences.csv\"\n",
    "output_csv = \"../data/taz2015_sample_forms_predicted.csv\"\n",
    "\n",
    "### Batch size\n",
    "batch_size = 8\n",
    "\n",
    "if __name__=='__main__':\n",
    "\n",
    "    # read sentences from protest articles\n",
    "    df_sentences = pd.read_csv(input_csv,  sep= \",\", encoding=\"utf8\")\n",
    "    df_sentences['sentence'] = df_sentences['sentence'].astype(str)\n",
    "\n",
    "\n",
    "    # set model\n",
    "    model_fine = SentenceTransformer('shaunss/protestforms_mpnet-base-v2')\n",
    "\n",
    "    # load classifyer\n",
    "    import pickle\n",
    "    filename = '../models/protest_form_classification_all_v1.sav'\n",
    "    LR = pickle.load(open(filename, 'rb'))\n",
    "\n",
    "    ### calculate embeddings and predict forms\n",
    "    embeddings = []\n",
    "    \n",
    "    for i in tqdm(range(0, len(df_sentences), batch_size), desc=\"Encoding sentences\"):\n",
    "       batch = df_sentences['sentence'][i:i+batch_size].tolist()\n",
    "       batch_embeddings = model_fine.encode(batch)\n",
    "       embeddings.extend(batch_embeddings)\n",
    "\n",
    "    # predict categories\n",
    "    df_sentences[\"prediction\"] = LR.predict(embeddings)\n",
    "    # save prediction scores\n",
    "    df_sentences[\"score\"] = LR.predict_proba(embeddings)[:, 1]\n",
    "    # save file\n",
    "    df_sentences.to_csv(output_csv, index=False)\n",
    "\n",
    "    print(f\"Form predictions with scores saved to {output_csv}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 5: Location Classification\n",
    "FOr location classification we use a curated list of German cities, where ambiguous locations have been pruned. This works well for cities and villages in Germany. For locations outside Germany, either a localized version of \"loc_patterns_de.pickle\" has to be provided or a more generic location classifier like Mordecai can be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Running Spacy pipeline:  33%|███▎      | 90/271 [00:11<00:23,  7.66it/s]\n"
     ]
    },
    {
     "ename": "KeyboardInterrupt",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m                         Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[23], line 34\u001b[0m\n\u001b[1;32m     32\u001b[0m \u001b[38;5;66;03m# apply NLP model to column pred_text\u001b[39;00m\n\u001b[1;32m     33\u001b[0m text \u001b[38;5;241m=\u001b[39m df\u001b[38;5;241m.\u001b[39mpred_text\u001b[38;5;241m.\u001b[39mastype(\u001b[38;5;28mstr\u001b[39m)\n\u001b[0;32m---> 34\u001b[0m docs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mlist\u001b[39m(tqdm(nlp\u001b[38;5;241m.\u001b[39mpipe(text), total\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mlen\u001b[39m(text), desc\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mRunning Spacy pipeline\u001b[39m\u001b[38;5;124m\"\u001b[39m))\n\u001b[1;32m     37\u001b[0m \u001b[38;5;66;03m# generate column with publication places\u001b[39;00m\n\u001b[1;32m     38\u001b[0m newspapers \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mread_excel(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m../data/newspaperlist.xlsx\u001b[39m\u001b[38;5;124m'\u001b[39m)\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/tqdm/std.py:1178\u001b[0m, in \u001b[0;36mtqdm.__iter__\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m   1175\u001b[0m time \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_time\n\u001b[1;32m   1177\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m-> 1178\u001b[0m     \u001b[38;5;28;01mfor\u001b[39;00m obj \u001b[38;5;129;01min\u001b[39;00m iterable:\n\u001b[1;32m   1179\u001b[0m         \u001b[38;5;28;01myield\u001b[39;00m obj\n\u001b[1;32m   1180\u001b[0m         \u001b[38;5;66;03m# Update and possibly print the progressbar.\u001b[39;00m\n\u001b[1;32m   1181\u001b[0m         \u001b[38;5;66;03m# Note: does not call self.update(1) for speed optimisation.\u001b[39;00m\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/spacy/language.py:1618\u001b[0m, in \u001b[0;36mLanguage.pipe\u001b[0;34m(self, texts, as_tuples, batch_size, disable, component_cfg, n_process)\u001b[0m\n\u001b[1;32m   1616\u001b[0m     \u001b[38;5;28;01mfor\u001b[39;00m pipe \u001b[38;5;129;01min\u001b[39;00m pipes:\n\u001b[1;32m   1617\u001b[0m         docs \u001b[38;5;241m=\u001b[39m pipe(docs)\n\u001b[0;32m-> 1618\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m doc \u001b[38;5;129;01min\u001b[39;00m docs:\n\u001b[1;32m   1619\u001b[0m     \u001b[38;5;28;01myield\u001b[39;00m doc\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/spacy/util.py:1685\u001b[0m, in \u001b[0;36m_pipe\u001b[0;34m(docs, proc, name, default_error_handler, kwargs)\u001b[0m\n\u001b[1;32m   1675\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m_pipe\u001b[39m(\n\u001b[1;32m   1676\u001b[0m     docs: Iterable[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mDoc\u001b[39m\u001b[38;5;124m\"\u001b[39m],\n\u001b[1;32m   1677\u001b[0m     proc: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mPipeCallable\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m   1682\u001b[0m     kwargs: Mapping[\u001b[38;5;28mstr\u001b[39m, Any],\n\u001b[1;32m   1683\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m Iterator[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mDoc\u001b[39m\u001b[38;5;124m\"\u001b[39m]:\n\u001b[1;32m   1684\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mhasattr\u001b[39m(proc, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mpipe\u001b[39m\u001b[38;5;124m\"\u001b[39m):\n\u001b[0;32m-> 1685\u001b[0m         \u001b[38;5;28;01myield from\u001b[39;00m proc\u001b[38;5;241m.\u001b[39mpipe(docs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[1;32m   1686\u001b[0m     \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m   1687\u001b[0m         \u001b[38;5;66;03m# We added some args for pipe that __call__ doesn't expect.\u001b[39;00m\n\u001b[1;32m   1688\u001b[0m         kwargs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mdict\u001b[39m(kwargs)\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/spacy/pipeline/pipe.pyx:57\u001b[0m, in \u001b[0;36mpipe\u001b[0;34m()\u001b[0m\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/spacy/pipeline/entityruler.py:162\u001b[0m, in \u001b[0;36mEntityRuler.__call__\u001b[0;34m(self, doc)\u001b[0m\n\u001b[1;32m    160\u001b[0m error_handler \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mget_error_handler()\n\u001b[1;32m    161\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 162\u001b[0m     matches \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmatch(doc)\n\u001b[1;32m    163\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mset_annotations(doc, matches)\n\u001b[1;32m    164\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m doc\n",
      "File \u001b[0;32m~/miniconda3/envs/ml_papea/lib/python3.11/site-packages/spacy/pipeline/entityruler.py:172\u001b[0m, in \u001b[0;36mEntityRuler.match\u001b[0;34m(self, doc)\u001b[0m\n\u001b[1;32m    170\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m warnings\u001b[38;5;241m.\u001b[39mcatch_warnings():\n\u001b[1;32m    171\u001b[0m     warnings\u001b[38;5;241m.\u001b[39mfilterwarnings(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mignore\u001b[39m\u001b[38;5;124m\"\u001b[39m, message\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;130;01m\\\\\u001b[39;00m\u001b[38;5;124m[W036\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m--> 172\u001b[0m     matches \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mlist\u001b[39m(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmatcher(doc)) \u001b[38;5;241m+\u001b[39m \u001b[38;5;28mlist\u001b[39m(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mphrase_matcher(doc))\n\u001b[1;32m    174\u001b[0m final_matches \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m(\n\u001b[1;32m    175\u001b[0m     [(m_id, start, end) \u001b[38;5;28;01mfor\u001b[39;00m m_id, start, end \u001b[38;5;129;01min\u001b[39;00m matches \u001b[38;5;28;01mif\u001b[39;00m start \u001b[38;5;241m!=\u001b[39m end]\n\u001b[1;32m    176\u001b[0m )\n\u001b[1;32m    177\u001b[0m get_sort_key \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mlambda\u001b[39;00m m: (m[\u001b[38;5;241m2\u001b[39m] \u001b[38;5;241m-\u001b[39m m[\u001b[38;5;241m1\u001b[39m], \u001b[38;5;241m-\u001b[39mm[\u001b[38;5;241m1\u001b[39m])\n",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
     ]
    }
   ],
   "source": [
    "# -*- coding: utf-8 -*-\n",
    "import spacy\n",
    "import pandas as pd\n",
    "import pickle\n",
    "from tqdm import tqdm\n",
    "\n",
    "# define input and output files\n",
    "input_csv = \"../data/taz2015_sample_relevant.csv\"\n",
    "output_csv = \"../data/taz2015_sample_location_predicted.csv\"\n",
    "\n",
    "\n",
    "# load dataframe\n",
    "df = pd.read_csv(input_csv)\n",
    "\n",
    "# Unpickling\n",
    "# load curated localized pattern data\n",
    "with open(\"../data/loc_patterns_de.pickle\", \"rb\") as fp:\n",
    "   patterns = pickle.load(fp)\n",
    "\n",
    "# load and change NLP model\n",
    "\n",
    "# use only Spacy pipeline elements that are necessary\n",
    "spacy.require_gpu()\n",
    "# spacy.prefer_gpu()\n",
    "\n",
    "nlp = spacy.load(\"de_core_news_md\", disable=[\"tok2vec\", \"tagger\",'morphologizer', \"parser\", \"attribute_ruler\", \"lemmatizer\"])\n",
    "\n",
    "# add ruler and allow modifiaction of 'ents'\n",
    "ruler = nlp.add_pipe(\"entity_ruler\", config={\"overwrite_ents\":'true'})\n",
    "ruler.add_patterns(patterns)\n",
    "\n",
    "# apply NLP model to column pred_text\n",
    "text = df.pred_text.astype(str)\n",
    "docs = list(tqdm(nlp.pipe(text), total=len(text), desc=\"Running Spacy pipeline\"))\n",
    "\n",
    "\n",
    "# generate column with publication places\n",
    "newspapers = pd.read_excel('../data/newspaperlist.xlsx')\n",
    "\n",
    "newspapers.drop(columns=['newspaper', 'company', 'available_from', 'available_until'], inplace=True)\n",
    "\n",
    "df=df.join(newspapers.set_index('source'), on='source')\n",
    "\n",
    "# addf found names of cities and neighborhoods\n",
    "\n",
    "# append unique city names from pred_text\n",
    "cities_unique = []\n",
    "for i in range(len(docs)):\n",
    "  cities_unique.append([])\n",
    "  for ent in docs[i].ents:\n",
    "    if ent.label_ == 'ORT':\n",
    "      cities_unique[i].append(ent.text)\n",
    "\n",
    "cities_found=[]\n",
    "for i in cities_unique:\n",
    "  a = list(dict.fromkeys(i))\n",
    "  cities_found.append(a)\n",
    "\n",
    "df[\"cities_found\"] = cities_found\n",
    "\n",
    "# append unique city names that also can be names of neighborhoods\n",
    "nbh_unique = []\n",
    "for i in range(len(docs)):\n",
    "  nbh_unique.append([])\n",
    "  for ent in docs[i].ents:\n",
    "    if ent.label_ == 'ORT_TEIL':\n",
    "      nbh_unique[i].append(ent.text)\n",
    "\n",
    "city_places=[]\n",
    "for i in nbh_unique:\n",
    "  a = list(dict.fromkeys(i))\n",
    "  city_places.append(a)\n",
    "\n",
    "df[\"city_places_found\"] = city_places\n",
    "\n",
    "# append unique names of neighborhoods\n",
    "neighbourhood_unique = []\n",
    "for i in range(len(docs)):\n",
    "  neighbourhood_unique.append([])\n",
    "  for ent in docs[i].ents:\n",
    "    if ent.label_ == 'ORTSTEIL':\n",
    "      neighbourhood_unique[i].append(ent.text)\n",
    "\n",
    "neighbourhood=[]\n",
    "for i in neighbourhood_unique:\n",
    "  a = list(dict.fromkeys(i))\n",
    "  neighbourhood.append(a)\n",
    "\n",
    "df[\"neighbourhood_found\"] = neighbourhood\n",
    "\n",
    "\n",
    "# remove if nothing is found\n",
    "\n",
    "# no city name\n",
    "no_city = []\n",
    "for i in df.index:\n",
    "  if len(df.cities_found[i]) == 0:\n",
    "    no_city.append(i)\n",
    "\n",
    "no_city_found = df.loc[no_city]\n",
    "\n",
    "# also no ambiguous coty name found\n",
    "Po_t0 = []\n",
    "for i in no_city_found.index:\n",
    "  if len(no_city_found.city_places_found[i]) == 0:\n",
    "    Po_t0.append(i)\n",
    "\n",
    "no_city_temp_df = no_city_found.loc[Po_t0]\n",
    "\n",
    "# also no neignborhood\n",
    "no_nbh = []\n",
    "for i in no_city_temp_df.index:\n",
    "  if len(no_city_temp_df.neighbourhood_found[i]) == 0:\n",
    "    no_nbh.append(i)\n",
    "\n",
    "no_place_found = no_city_temp_df.loc[no_nbh]\n",
    "\n",
    "\n",
    "# drop cases where no location name was found\n",
    "df.drop(no_place_found.index, inplace=True)\n",
    "\n",
    "# Predict place\n",
    "\n",
    "# function for location names\n",
    "def place_pred(row):\n",
    "    if len(row['cities_found']) != 0:\n",
    "        val = row['cities_found']\n",
    "    elif len(row['city_places_found']) != 0:\n",
    "        val = row['city_places_found']\n",
    "    else:\n",
    "        val = str(row['place'])\n",
    "    return val\n",
    "\n",
    "# create new column with found places\n",
    "df['pred_place'] = df.apply(place_pred, axis=1)\n",
    "\n",
    "df.to_csv(output_csv)\n",
    "print(f\"Location predictions with scores saved to {output_csv}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 6: Date Identification\n",
    "Date identification and consolidation of the final dataset is done in R. The necessary code can be found in the file \"2_papea_pipeline_R.Rmd\""
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ml_papea",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
